Efficient Algorithms for Clustering and Classifying High Dimensional Text and Discretized Data using Interesting Patterns

نویسنده

  • Hassan H. Malik
چکیده

Efficient Algorithms for Clustering and Classifying High Dimensional Text and Discretized Data using Interesting Patterns Hassan H. Malik Recent advances in data mining allow for exploiting patterns as the primary means for clustering and classifying large collections of data. In this thesis, we present three advances in pattern-based clustering technology, an advance in semi-supervised pattern-based classification, and a related advance in pattern frequency counting. In our first contribution, we analyze numerous deficiencies with traditional pattern significance measures such as support and confidence, and propose a web image clustering algorithm that uses an objective interestingness measure to identify significant patterns, yielding measurably better clustering quality. In our second contribution, we introduce the notion of closed interesting itemsets, and show that these itemsets provide significant dimensionality reduction over frequent and closed frequent itemsets. We propose GPHC, a sub-linearly scalable global pattern-based hierarchical clustering algorithm that uses closed interesting itemsets, and show that this algorithm achieves up to 11% better FScores and up to 5 times better entropies as compared to state-of-the-art agglomerative, partitioningbased, and pattern-based hierarchical clustering algorithms on 9 common datasets. Our third contribution addresses problems associated with using globally significant patterns for clustering. We propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC allows each instance to "vote" for its representative size-2 patterns in a way that ensures an effective balance between local and global pattern significance, produces more descriptive cluster labels, and allows a more flexible soft clustering scheme. Results of experiments performed on 40 standard datasets show that IDHC almost always outperforms state-of-the-art hierarchical clustering algorithms and achieves up to 15 times better entropies, without requiring any tuning of parameter values, even on highly correlated datasets. In our fourth contribution, we propose CPHC, a semi-supervised classification algorithm that uses a pattern-based cluster hierarchy as a direct means for classification. All training and test instances are first clustered together using our instance-driven pattern-based hierarchical clustering algorithm, and the resulting cluster hierarchy is then used directly to classify test instances, eliminating the need to train a classifier on an enhanced training set. For each test instance, we first use the hierarchical structure to identify nodes that contain the test instance, and then use the labels of co-existing training instances, weighing them proportionately to their pattern lengths, to obtain class label(s) for the test instance. Results of experiments performed on 19 standard datasets show that CPHC outperforms a number of existing classification algorithms even with sparse training data. Our final contribution deals with the problem of finding a dataset representation that offers a good space-time tradeoff for fast support (i.e., frequency) counting and also automatically identifies transactions that contain the query itemset. We compare FP Trees and Compressed Patricia Tries against several novel variants of vertical bit vectors. We compress vertical bit vectors using WAH encoding and show that simple lexicographic ordering may outperform the Gray code rank-based transaction reordering scheme in terms of RLE compression. These observations lead us to propose HDO, a novel Hamming-distance-based greedy transaction reordering scheme, and aHDO, a linear-time approximation to HDO. We present results of experiments performed on 15 common datasets with varying degrees of sparseness, and show that HDO-reordered, WAH encoded bit vectors may take as little as 5% of the uncompressed space, while aHDO achieves similar compression on sparse datasets. With results from over 10 database and data mining style frequency query executions, we show that bitmap-based approaches result in up to 10 times faster support counting, and that HDO-WAH encoded bitmaps offer the best space-time tradeoff.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

High Performance Implementation of Fuzzy C-Means and Watershed Algorithms for MRI Segmentation

Image segmentation is one of the most common steps in digital image processing. The area many image segmentation algorithms (e.g., thresholding, edge detection, and region growing) employed for classifying a digital image into different segments. In this connection, finding a suitable algorithm for medical image segmentation is a challenging task due to mainly the noise, low contrast, and steep...

متن کامل

High Performance Implementation of Fuzzy C-Means and Watershed Algorithms for MRI Segmentation

Image segmentation is one of the most common steps in digital image processing. The area many image segmentation algorithms (e.g., thresholding, edge detection, and region growing) employed for classifying a digital image into different segments. In this connection, finding a suitable algorithm for medical image segmentation is a challenging task due to mainly the noise, low contrast, and steep...

متن کامل

Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories

In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008